Below are examples of the theorems proved in Kleinberg's paper (https://www.cs.cornell.edu/home/kleinber/nips15.pdf)
In [1]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import make_blobs
data, labels = make_blobs(n_samples=100, n_features=2, centers=2,cluster_std=4,random_state=2)
plt.scatter(data[:,0], data[:,1], c = labels, cmap='coolwarm');
In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import make_blobs
data, labels = make_blobs(n_samples=10, n_features=2, centers=2,cluster_std=3,random_state=5)
plt.scatter(data[:,0], data[:,1], c = labels, cmap='coolwarm');
k-cluster stopping condition: Stop adding edges when the subgraph first consists of k connected components.
Richness condition:Every partition of S is a possible output. To state this more compactly, let Range(f) denote the set of all partitions Γ such that f(d) = Γ for some distance function d. Richness. Range(f) is equal to the set of all partitions of S. In other words, suppose we are given the names of the points only (i.e. the indices in S) but not the distances between them. Richness requires that for any desired partition Γ, it should be possible to construct a distance function d on S for which f(d) = Γ.
In [3]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=1)
kmeans.fit(data)
plt.scatter(data[:,0], data[:,1], c = kmeans.labels_, cmap='coolwarm');
Given inputs, no distance function will allow this algorithm to classify the points into any desired partition. The only partition available is the one given.
This model was unable to provide the various cluster partitions based on different distance functions. It thereby fails the "richness" condition.
In general, models with k-cluster stopping conditions will be similarly limited by being unable to create all possible partitions given various distance functions.
Distance-r stopping condition: Only add edges of weight at most r
Scale-Invariance: For any distance function d and any α > 0, we have f(d) = f(α · d).
Max r for algorithm = 4
In [4]:
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=4,metric='precomputed')
distance = euclidean_distances(data,data)
distance_times_alpha = distance*1.1
Euclidean Distance * 1
In [5]:
db.fit(distance)
euc = db.labels_
plt.scatter(data[:,0], data[:,1], c = euc, cmap='coolwarm');
Euclidean Distance * 1.1
In [6]:
db.fit(distance*1.1)
euc_alpha = db.labels_
plt.scatter(data[:,0], data[:,1], c = euc_alpha, cmap='coolwarm');
In [7]:
#-1 or 1 values are differently classified by each model. Data point at i=i was differently classified.
euc - euc_alpha
Out[7]:
Scale-α stopping condition: Let ρ ∗ denote the maximum pairwise distance; i.e. ρ ∗ = maxi,j d(i, j). Only add edges of weight at most αρ∗
Consistency condition: Consistency. Let d and d0 be two distance functions. If f(d) = Γ, and d0 is a Γ-transformation of d, then f(d0 ) = Γ.
In [8]:
from sklearn.cluster import AgglomerativeClustering
In [9]:
agg_model_euclidean = AgglomerativeClustering(n_clusters=2,affinity='euclidean',linkage='complete')
agg_model_euclidean.fit(data)
plt.scatter(data[:,0], data[:,1], c = agg_model_euclidean.labels_, cmap='coolwarm');
In [10]:
agg_model_manhattan = AgglomerativeClustering(n_clusters=2,affinity='manhattan',linkage='complete')
agg_model_manhattan.fit(data)
plt.scatter(data[:,0], data[:,1], c = agg_model_manhattan.labels_, cmap='coolwarm');
In [11]:
#-1 or 1 values are differently classified by each model. Data points at i=0 and i=3 were differently classified.
agg_model_euclidean.labels_-agg_model_manhattan.labels_
Out[11]: